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ABSTRACT 

Sorting Isozymes are encoded by tingle genee, but 
the encoded proteins are distributed to multiple 
subcellular compartments. We surveyed the predicted 
protein sequence* of several nucleic acid Interacting 
sorting Eeozymes from the eukaryotlc taxorramfc 
domain and compared them with their homologa In 
the archaeal and eubacterlal domains. Here, we 
eummarize the data showing that the eukaryotlc 
sorting isozymes often poeeess sequences not 
present In the archaeal and eubaoterial counterparts 
and that the additional sequences can act to target 
the eukaryotlc proteins to their appropriate subcellular 
locations. Therefore, we have named these protein 
domains ADEPTe (Additional Domains for Eukaiyotlo 
Protein Targeting). Identification of additional 
domains by phylogenetlc comparisons should be 
generally useful for locating candidate sequences 
Important for subcellular distribution of eukaryotlc 
protelne. 



INTRODUCTION 

Bikaryrttcs are typified by the possession of organelles, generating 
numerous subcellular locations separated from one another by 
one or more membranes. Generally the different subcellular 
compartments carry nul unique biochemical reactions. 
However, sometimes the same catalytic activity is found in 
more ih;m one subcellular compartment. There arc three 
different mechanisms used by eukaryotlc cells to deliver the 
same enzymatic activity to more (ban one subcellular location. 
First, the same catalytic activity may be encoded by dissimilar 
genes. For example, cognate mitochondrial and- cytosolie 
Aminoacyl-tRNA synthetases can be quite distinct (L2). 
Second, a caudytic activity may be encoded hy multiple similar 
eencR, each coding an isozyme with unique subcellular distribution, 



Tlic yeas* geues, ADM, ADM and ADHh provide an example 
of this typo of mechanism (.1). Finally, a single gene may encode 
two or more isdxymei with different subcellular distributions. 
Thaw! proteins aic called 'sorting isozymes* and are irmirvtd in 
many important metabolic processes (for a review see 4,5). 

Sorting isozymes must contain information necessary for 
protein distribution lo different compartments without Comoro* 
miring catalytic kciivlty , Cellular mochonism* ihat achieve ihi* 
arc varied. In some cases, alternative transcriptional initiation 
generates rnRHto thnt encode ibo catalytic portion with or 
without signals i or specific comportments. In other Cases, the 
same end Is achieved by alternative translational initiation ox 
alternative splicing. Finally post-lranslmicwal modifications 
can also alter die targeting information without altering catalytic 
activity (for a review see 4,5), In this report wc focus on *e 
dj-ociing signals responsible for sorting isozyme cEstribuiion. 

Genome sequencing efforts have generated information for 
several archacalfai* arc complete and u few others arc ncarfng 
completion; TIGR* http://www r tigr.org/ldl^rndh(rridhJ)tnil ). 
many cubactcriajl (19 arc complete and rrmny others are well 
underway), many, many vim) and jteveral eukaryotlc nuclear as 
well as over 100 mitochondrial and 1 1 chloraplast orgariell&r 
gonomo* [see Entrcz Genomes at NCBI, Iutp;/Awww.ncbta1in. 
nih.gov/Enlrez/Gcnomc/org.htrnl ). Indeed, the sequences of 
two eukaryotlc nbclear genomes ore virtually complete (6,7). If 
one assumes dtad sequences important flo catalytic (Unction will 
be conserved, thin compnricoriN of cukoryotic sorting isozymes 
to their counterpart proteins in non-cult myotic organisms might 
reveal the region's of the proteins serving the sorting function, 
To test this ass timpuon we conducted phy logcndk: cttrr^wbons 
of five proteins. Wc chose genes that had been runctiorfaJly 
characterized byj cell biology and molecular biology expert- 
mcntsfor their nuclear and mitochondrial targeting signals and 
some fur cytoplasmic retention/nuclear export signal*,' We 
used mree criteijia to choose these proteins, (i) At least one 
eukaryotlc member of the family has been shown directly to be 
a sorting Uoiyde and there is detailed information regarding 
the cjj»actrnff vdqucnecs involved in subcellular distribution 
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(Rg. I )• (11) The oftiiifyiic functions are found in phyJogcnctically 
dirtinct organisms, (IN) The proteins interact with nucleic adds. 

Five sorting isozymes thai fil our criteria arc: (i) Mod5p 
camlyuiif ihg modification of Ay, to i*A w on iRNA; (ii) Trmlp 
catalyzing the modification or G u to m^G^ on iRNA; 
(Hi) Hwlp the liistldyMRNA synthetase: (iv) Cca I p catalyzing 
m* addition «f C, C and A to the 3' ends of tRNAs: and 
(v) Utigtp a uracil-DNA glycosylase involved in DNA repair. 
Searches of databases demonstrate that cukaryotic counier- 
pnrts of these proteins have domains in ibo same places, that 
archacal/oubuctcriaf counterparts do not. These comparisons, 
coupled with previous functional characterization of the protein 
domains, in nt least one case led us to conclude that the additional 
information can serve to direct Ihc cukaryotic proteins to the 
appropriate subcellular destination. We have named the 
cukaryotic additions ADEPT* (Additional Domains in Eukaiyotes 
Tor Protein Targeting). We speculate that Identification of 
'additional domains' by phylogcnctic comparisons and 
multiple sequence alignment will provide predictive information 
to locate unknown sequences important for ihe cellular dislri- 
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bution of eukaryojic prolans. Such analyses might also provide 
information forchiraewrfciag novel protein tangciinemoiin 

METHODS AND EXPLANATION OF ALIGNMENTS 



; were compared employing several databases 
J * DDBJi PDB, SWISS-PROT. PiR, PRF» 
dbESTj dbSTS, GSS and HTGS) using the BLAST (») server 
at NCBI (hwp;/ywww.ncbi.nlm.nih.|fC)v/BLAST/ ) t Slmilor 
proteins were identified, retrieved and used to search for 
additional matches. The retrieved sequences were aligned 
using either Clusn il W or X (9, 1 0). The aligned sequences were 
adjusted manualljr and shaded based on the BLOSUM 62 
scoring matrix (||) with some weighting based on physical 
properties of the a mtno acids ( 1 2). 

Table I lists ih & organisms and ucccssion number* of the 
peptides used in t ie alignments. An cJtpnndcd version of ihit 
S*S SL^ ,c VI Jfl wWMc us Supplementary Material at 
NAR Online. When die prokaryoiic pcpiides usa\ in the align- 
ments originate from an incomplete genomic sequence and do 
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. . . fndicBiLi Additional vrwiej are available in Tnbte St. 

not have an official accession number, the table is linked lo (he 
relevant genome sequencing center For each of the individual 
alignments, not all organisms contain a peptide entry. 

The dato ore presented in two ways. Figures S 1*55 available as 
Supplementary Material at NAR Online, show the actual 
amino acid sequence alignment information, A score of St 
from the BLOSUM 62 matrix is designated w similar while a 
score of 0 is considered a weak similarity. Amino acids are 
grouped and colored 89 follows: aromatic amino acids phenyl- 
alanine, tyrosine and tryptophan (FYW) are magenta; hydro- 
phonic amino acids iaolcucine. valine, leucine and methionine 
(JVJ.M) are cyan: charged/polar amino acids aspartie acid, 
glutamic acid, gluvunina, lysine, argininc. asparaginc and 
histidinc (DEQKRNH) are red; small amino acids glycine. 
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serine and threonine (GACST) arc green; 
blue. Three or more of u given amino acid 



yields upper cose and color is turned on when ax IcaM five of a 
given amino acid or three of a given nmino acid plus at least 
three amino acids from the same group with a score a I are 
present* For the consensus | mes 17.49% identity results In a 
lower ease letter, 50-74% identity results in an upper ease 
letter and 75-100*6 identity rcsulu in an upper cukc underlined 
letter. 

Figures 2-6 she w schematic diagrams or the protein align- 
ments based on he sequence alignments described obove. 
Blocks or similar <;olor rcprciieni blnvks or frequence similarity 
and Mr© not a rc >rcsenioilon of any /tiriictuml Inrtmiiiuian. 
Different colored] Ixuqk represent uninterrupted region* of 
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--w- ~ „ mm _ some of the eui aryoric peptides Judged iu be (no incomrilttc 



ore not shown 
were selected to 



PlCUro 1. Location of fa formation for subcellular distribution of sorting 
fsfl7ymrs. Known and prannned targeting itignak aro rapnuantad »« colored 
hour*. Moflcma bo*c« roprcwm known mitochondrial wrgctfng Information. 
Taul hrt*a> and Muc hoicfi represent known and preramcd (NLS?) nuclear 
targeting information, respectively. Purple bows may urgotTrmlp to o lUbrraoioBr 
IwflUun and ihe grvvn no*«t In Modflp may be reaponnMc far the predOTtilnsnOy 
cyiofloijc ubtribuifon of U1I1 protein. CRD, cytoplasmic resemJon domain; 
N6S, nuclear Mpon signal, Tito Mock lines represent the conserved region* of 
each protein DPd ore not to kcM<\ The subcellular distributions of die various 
form* of caeh protein arc also indicated. Fur Hurl p. -V+ refers to locations 
detected upon protein ovcr-exprcsRion. 



similarity (ai least 3S%) beiween the proteins from different 
organisms, DIuek lines represent cukaryotic sequences not 
generally similar to each other. Cfuy lines represent prokaryotic 
sequences not generally similar to each other or the cukaryotic 
sequences. Not all the sequences depicted are complete and 



in the schcmulic diagrams. Eight cukaryoics 
.. 0 represent the domain Eukaryn; Horrto su pirns t 
Mks mtacuhis, Ca^wHmbdiHs e^.r, Plasmodium fisklparuml 
tehizosaceharotoyces pombe, SacvharvwYccs wnvisiae and 
Cundida albicans. Plants are usually represented as a 
composite diugrim due to the lack of complete sequence infor- 
mation. An I to the right of the schematics designates Inconv 
piece inforrnatkn and n C designates complete cDNA or 
genomic DNA sequence inftjmwiion, The lengths of the 
nolypcpridc chairis are indicated and where a composite schcmniic 
is shown the lengths of the individual polypeptide chains arc 
separated by slaves, The cubacicrial And archacal schematics 
are derived from consensus sequences and the number of 
peptides used up generate the consensus is also indicated. 
Where information is available concerning the site of mown 
exon junctions, the locations of imruns are marked wf th an x» 

rIESULTS ANU DISCUSSION 

MedSp homoTogs and conservation of regions for 
flubccUulardistrHbution 

We previously r sported on alignment ofModSp/MiaA from 
33 eubactcria and three eukaryoies (13). Our continued search 
for Mod5p homplogs has now uncovered Mod3p/MtaA In 
45 cubacteria (set Table I ). Two cubacicrial organisms do not 
contain a mlaA ginc {Mycoplasma gtfdmlhmi and Mycoplasma 
pnetwrwiae) while one. Porphyromonas gmgivalis, contains 
two mlaA genes. Seventeen cukaryotic lutrnuiogs were identified 
in fiAeen orgai tsms {H.sapiens, M.mttscxdns x Drosophtfa 
melanogast*^ C.etegunsx P.fittrtpamm, Cryplosporidlum 
parvum, LehhmartUt mqjor k Trypanosoma hrinc-i, Arabidopiis 



thafhna, Oryza 



KocJSjp 




Figure 2, St - Iwinwic digram orMndSp alignment. Not aft of the Aikuyoite homalng, Br « shown in ihu 

WcmfrlwJ In Uw ttrcfwcal domain. Rflgtftm of uninterrupted mjutnee «milartly (at lcaei 33*) wTtfl^ « 
explanation*;, 



im/vo. Ipombe, S.cer^visiar, Catbicans. 
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Kluywmmyces tactls and Nenrosporo crvssa). Saccharofnyces 
eewtsiae and CvUgans have only one gone encoding this 
proicm. Only eight of the eukaryotic ModSps arc shown in 
Figure 2 and die 46 eubactcrial MiaA homologs arc represented In 
Figure 2 as a consensus schematic. The entry for plants in 
Figure 2 represents a composite of three AjHaffana homologs 
end one homolog from rice. No homologs were Identified in 
archaea, consistent wiih the fact that i e A has not been found on 
tRNAs isolated from organisms in the archaeai domain 
(14.15). 

By alternative translational starts the S.ctrevisiat MODS 
gene encode* two proteins, Mod5p-I and Mod5p-II (16), which 
arc differentially partitioned between the cytoplasm, mitochondria 
and nucleus (17). Mod5p-I is located in the mitochondrial and 
cytosolic compartments whereas Mod5p-n is in the cytosol 
and the nucleus. Amino acids 1-20 comprise a mitochondrial 
targeting sequence (MTS) necessary for distribution of 
Mod5p-r to the mitochondria (17). 

MTSs arc usually located at the N-tcrminus, contain basic 
and hydrophobic amino acids and arc predicted to form 
amphlphllie a-helices; however, there is no linear consensus 
sequence for mitochondrial targeting information (18,19). To 
assess whether other cukaryotcs may utilize the same strategy as 
tliai for S.cMvisi<re. we investigated the N-iermina! regions of 
the other cukaryoiic ModS proteins. Five of the euknryotic 
h'snolug* {S.certvi$ine t Cttegaas t Calbicans, P falciparum and 
one of the homology from A,ih£(tiana) contain multiple ATOs 
ai die beginning of the eodjng'rcgion (Hg. 2). while for most of 
the other eukuryotes there Is Insufficient information available to 
predict whether or not multiple translation initiations give rise to 
dilTcrcni isozymes. The amphiphillc nature of these N-tcrmina) 
peptides was investigated by plotting them on a helical wheel 
projection (not shown), in addition to Sxerevistac, the 
C.degans and Calbivans N-icrminal regions resemble other 
MTSs (18,1 9). Thus, we predict thai the Cetegcm and C.albkans 
Mod5 proteins will also be sorted between the cytoplasm and 
mitochondria. The N-terminal regions of the Falciparum 
homolog and the A. thai i ana homolog with an N-termtnal 
extension (Wg. si, Athaippt) do not resemble other MTSs. In 
general, the eubactcrial proteins do not have this N-termJnal 
extension bolstering (he Idea that this extra domain found in 
the eukaryotic proteins is used for targeting. 

Arahidapsis thaliana has at least throe genes predicted to 
encode MoU5 proteins; therefore, different genes may well 
provide the same catalytic activity to dilTcrcnt compartments 
for this organism. While additional information concerning 
A.thalkma and other euknryotic organisms will be required to 
determine how mitochondrial/chloroplast/cytoplasmic/auclear 
sorting may be achieved, it appears that Tor the Mdd5p ramily 
sometimes one gene codes a catalytic activity found In 
multiple compartment* whereas in other coses, two or more 
genes may code the isozymes, 

Nearly all of the eukaryotic ModS proteins possess -50 
amino acids at the Cterminus that are not present in the eubac- 
tcrial MiaA proteins (FSg. 2). Tlic S.ccrevlslat Mod5p nuclear 
localisation sequence (NLS) maps within this 'additional 
domain ( Jm | n o acids 408-428; 1 3). In all of the other Gukaryoles 
where sufficient sequence information is available (Fig, 2' 
£/>i9m*r. C.«lhteans % C.et*x*nx, rice and one of the A.thafJma 
niimoiogsj. the C-ierminal region i.i timilar leading to die 
prediction that Ihey all contain a NLS and that a portion of the 
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nucleus. Only 
this NLS region 
Ftg. 2), again 
located ModSp 

Besides the N 
the eukaryotic 
found in the 
additions overlap 
that were w 
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sequences contain 
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protein. 

A portion 
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these organisms will tt fc be located in the 
o)tc of the three Ajhdhna homologs contains 
whiic the others lack it (Fig. s I, not shown In 
suggesting that multiple genes encode differently 

terminal and C-torminal additional domains, 
odS proteins also contain internal domains not 
abacterial homologs (Fig. 2). These internal 
la]> the region between amino acids 240 and 2*0 
previously mapped to function in maintenance of th* 
cylusolic pool (13). AS all the eukaryotic 
m a similar region, we predict each of the 
torpans also has a cytosolic pool of this 



of (ht Sxerevistae Mod5p-U resides in the nucleolus 
used Tor nucleolar location has not been 
the NLS and MTS* the nucleolus targeting/ 
!j >n resides in motifs absent from the cubaetcrm 
candidate locations for nucleolar targeting 
araiiio acids 303 and 345 and/or 373 and 408. 



Thnlp homologb and conservation of region* for 
subcellular distribution 



found in eukaryotcs and arclmca, but arc 
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generally not present in cubaacritf (Ffg. 3). In addition to the 
Trmlp homology that have already been reported (20,21; six 
from the archaeai domain. Aqu\fr x aeofims, SxerevMat, 
S,pnmbc, Cefegtns and human) our searches revealed three 
additional arthacal homologs and incomplete sequences for 
mouse, rat, zsbraflsh, D.melanogajter, Pfateipahw, 
Cparvum, T.bru&eL A.thaHan*. rice. Bmsxica, Zea mayf tnd 
Calbieans. Thefc is only a single eubactcrial organism, 
AmoUcus* thai omains a tml gene and this is likely a result 
of horizontal transfer (22-24). Tn agreement with our alignment*, 
previous studies if rRNA modification have failed to uncover 
m\G in cubneterial tRNAs ( 1 4, i s,25), 

Eukaryotic and archaeai Trml proteins have considerable 
sequence similarity. However, like ModSp, the eukaryotic 
proteins contain extra sequence information at the N- and C- 
termini and imertally. The Sxtrevtsht TRMI gene contains 
ATC codons at positions 1 and 17. Human Trmlp contains lw 
ATGs within the first 37 codons while mouse Trm lp contnins 
three ATOa whhto ihe first 32 codem*. Of the eukaryotic genes 
that have been sequenced at Uie NMcrminos, only two. from 
Cetegans and Dlmeiaiwgaster do not have multiple ATGs 
within the first 50 codons. 

Some mitochondrial iRNAt* of 5.frnrv<j/w are modified 
Trmlp and ami mi acids* N48 of the S.cerevisfae Tnnlparc 
suflicicnt to targe this protein to mitochondria whereas amino 
acids 1-16 are no sufficient (26). There arc so vend rcporLs or 
m^G in mitochondria! and chloropkst tRNAs (27), but unfor- 
tunately the TRM! genes have not been sequenced for ihc 
organisms demonstrated to contuin m'jG modified mitochoiWrtal 
or chlomplast tRNAs. The N-tcrminus or the human Trmlp 
contains no acidic amino acids (Fig. S2) and when projocled 
upon a helical wheel, it is predicted to have an omphiphilic 
structure, characteristic of MTSs (19). Thus, the human gone 
could encode a Trmlp that sons to the mitochondria. The 
rodent homologs Arc very similar to the human in this region 
and the C.atttoafl^Trm 1 p N-iormim« contains wUut appears tu 
be a very good MTS. As the C.rhganr genome ecimafns only fl 
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Figure J. SdwmAiic diagram of Trmlp alignment. A sequence hllgnmcm of «II identified Trmlp ftoi*olo S i van be found in Figure *2. Nliw aitlinoiilTVm* 
pt nlldtt were h1tnl.l,ed a nU nr* wpiweni^l » * censtnw fequenee. On 0 crml homolog wu Identified in n w abacterial ffoiarin. The nchemafr Tor nlmt in ibis 
figure to o cwnpowic nf Ajbattamu a.«rAw and fi/tiwicn. Regiom yf tinimcmiptcd sequence similarity are thown i» crngjihaicJiud colored hones, See Mcihotfi 
fur oddutonol explanations 



single TRMI gcnc» it is likely lhat this gene provides ihe mito- 
chondrial pool of <RNA Cguanlnc-26,N^N') mcthyltransfcrnsc, 
if this modification occurs in C.eiegans mitochondria. 

Saceftammycts cerevislae Trmlp is also targeted to the 
nucleus and an efficient NLS resides between amino acids 95 
and 102 (28). All ihc oxher cukaryotic Trml proteins contain 
extra sequence informnlion in this some region (Fig. 3, black 
region between 103 and 156 or human Trmlp), The CtUgans 
(21) and D.nwJ<mo$asUr proteins contain basic ammo acids 
resembling ihe simple basic lypc of NLS in this region (see the 
review in 29), perhaps indicating a functional role in nucleus 
location* Tho corresponding extra sequences in human, mouse 
and S.pomlie are not nearly as basic as the S*ctrtvislae Trmlp 
sequence and neither a simple nor bipartite basic NLS motif 
can be identified in (his region, However, it has recently 
become apparent that there arc multiple nuclear import 
receptors in cukaryotic cells lhat have substrate specificities 
not yet delineated (sec the review in 30). If the ADEPT regions 
of human, mouse and S,pembe Trmlp arc used to sort this 
protein to the nucleus, as is the case in Sxereviskie. then 
phyluecnctic comparisons and sequence alignments may be a 
useful means to delineate non-conventional NLS motifs.. 

The cukaryotic genes Bl.so predict a large C-tenninal region 
and a smaller region (between amino acids 34<S and 367 in 
Sxertvlslae) not found in the archaeal proteins (Fig. 3). A zinc 
finger is present in the cukaryoiic proteins (amino acids 34&-3S7 
human Trmlp) that is present in only half of the prokaryotic 
protein*. When present In prokaryotic proteins, the finger loop 
is much smaller than that Found in cukaryotic proteins. The 
nuclear pool of Trmlp in S.cerewiae is located at the inner 
surface «r the nuclear membrane (28*31). If location at this 
KubnueJcar Kite ifi achieved via an ADEPT then we predict ihai 
thi: targeting information will map to either the large ^terminal or 



the smaller upstream cukaryruic additional sequences (Fig, 1, 
purple boxes and >Fig, 3). 

Others (32) have reported results both consistent and Incon- 
sistent with our hypothesis. Deletion of the first 44 amino acids 
of SxerwisiPt Trmlp docs not influence enssymotic activity, 
which is in accord with previous work demonstrating that this 
region contains targeting information (26) uy well as our 
prediction thai th(9 region of the other cukaryotic proteins will 
supply targeting information. However, a deletion of jusi five 
amino acids at Ihc C-rerminu* of S.ce revision Trmlp causes a 
significant reduction in activity (32). This Nsult i« inconsistent 
with our model in that all of the prokaryotic Inn I prulcins lack 
this region and thus it fa not expected to influence enzymatic 
activity. It is conceivable that an alteration in this region of the 
cukaryotic protciris may effect the highororder structure of the 
protein and interfere with activity, 

Hblp homologs jutd conservation of regions for 
subcellular distribution 

MTSf encodes histidine-lRNA synthetase, which is known as 
HisS in prokaryoies. Forty-five eubactcrial and eight archaeaJ 
homologs were identified and 30 oukaryolic homologs were 
found/ This enzyme is very similar in all three taxonomic 
domains (Fig. 4). Signature sequences can be identified that 
distinguish the eubactcrinl and archaeal proteins, and in some 
regions the archaeal signature Is mure similar to that of wkaryotcs 
than to that of eubactcria. 

Six of die cukaryotic homologs contain multiple ATGs in 
their 5 regions. However, the minority of tho cukaryotic 
sequences arc incomplete in this region and therefore we'arc 
unable to predict whether they encode proiciru that dirfor at the 
N-lcrmmus. In human* there are two genes arranged hcad'-io- 
head lhat code for hisiidlne-tRNA synthetases (Fig, 1 4, 
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HumHARS nnd HumH03), The proteins encoded by these two 
*encs arc very similar (9<M) V except at the N-tcrminus where 
the similarity is only 3836. The N-lcrmmus of HumHARS 
(residues 1-17) acidic whereas that of HumH03 is not. 
Therefore, dicsc iwo genes could provide the non-mitochondrial 
and mimehnndrinl forms of histidinc-tRNA synthetase; 
however, this has yet lo be determined. 

Like Mocffp ond Trmlp, where sufficient sequence information 
is avertable, the cukiiryotlc synthetases contain extra N-tcrmipal 
information not present in the cubacterial or arch oca I proteins. 
This region is precisely where the mitochondrial targeting 
sequence has been mapped for Icerevistae (33). In die fed 
algao Pvrphyra purpurea* u gene for histidinc-tRNA 
synthetase is present in the chloroplast genome. It is very 
simitar to the cubucicria] genos and does not coda an extra N- 
tcrminnl region. A nuclcor NTS! gene from AthaliatM that 
codes the orpncllar (mitochondrial and ehJoroplasl) 
synthetase has been reported (34). II is more similar to archaeal 
genes, however it does code extra ^terminal amino acids. 

Both Xowpus oocytes (35) and S.cerevittac (36) ammo- 
ocylatc fRNAs inside the nucleus as well as in the cytosul. 
Therefore, there must be nuclear pools of aminoacyl-tRNA 
synthetases, Hlslp indeed possesses information that directs 
it lo the nuclear interior, the targeting information could be 
located in the N-lcmiinal region (Fig, S3, amino acids 20-53 of 
HumHARS). The additional sequences m this location in 
eukaryoiic proteins contain basic residues resembling con ven* 
lionul NLS motifs (37). Fine mapping of ihe MTS in this 



region has not befcn completed and it is not yet clear where the 
MTS ends and w(here a putative NLS could begin. The MTS 
and Nl rt S signal* could al$o overlap. The majority of the 
eukaryoiic sequence* In this N-icrminaJ region contain a 
higher charge density than dt*s the S.cerevittc* (sequence. 
Alternatively, inc information could reside in the additional 
Information located between amino ackls 343 and 366 {Scer^'/Mm 
numbering). Thcjftingal counterpart* arc basic in this region 
while proteins fropt other cukaryoics are not, 

Eukaryotic ammoacyl-iRNA synthetases tend to be larger 
than their prokaryptic counterparts and these extensions lend to 
be at die N- or C-jtermlnus (38-41). The prevailing hypothesis 
is that these extensions arc in pan responsible for promoting 
the assembly of tRNA synthetase complexes found in cukaryoics 
(42). Wc and oihcjrs (37) suggest that some portion of the extra 
information found in eukaryoiic tRNA synthetases may be 
responsible for subcellular targeting, ' 

Cealp homolofis and conservation of regions ftir subcettutar 
distribution 

Organisms in aJliihree domuins contain ATP (CTP); tRNA 
nucleotidyltransferase actlvliy. However, the archacarcca 
proteins differ extensively fmm the cubacterial and eukaryoiic 
Cca proteins (43)J Nevertheless, all possess 'nuclcolidyflhinK- 
feme* motifs. OF the proteins wc studied Ccalp is the icofU 
well conserved bjetween cubacterJa and eukuryotcs. Large 
regions of sequence similarity, or found for the other proteins 
in our analysis, an* lacking in this family SUtocn cukuryotic 
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hotnologs were identified in the following organisms: S.Ccrcvlsiae. 
&.pamhc, Calbicans, human, mouse, rm\ Gelegtmx, D.mclano- 
giuw, Athafiana, Lupine, rice, Glycine war, Lmajor, Bru$ia 
maiayi and P/afcipanmi. Eight archncaj homology and 65 
cuhacteriul homuloga were identified, The Jaiier have been 
grouped into three busses (Cca, Pap and PcnB) based on ihe 
sequence alignments us well as previous nomenclature, A 
consensus schematic is shown for each of these three classes of 
eubactcrial proteins in Figure 5. 

In Stcerevisiat iho CCAt gene encodes three proteins 
(Ccalp-I, Ccalp-II ind Ccalp-III) that result from differential 
translation eturtrt ut throe in-frame AUGs (44). Bight of the 
eukaryoiic genes hflYe multiple ATGs in this N-tcrminal region 
(Fitf, 54). sufigesting that multiple forms of Ccalp could also 
he produced hy these genes. 

Ceolp-I from S.ccrevisioi' is located primarily in mitochondria 
whereas Cca 1 p-H and Cca J p-HT ore located both in the cytosol and 
iho nucleus (45). Like Mtid5p t Trmlp and Histp the N-Bermlnus of 
S.rvrvvhiae Ccalp contains mitochondrial targeting information. 
For each of the other oukitryoics where there is sufficient infer* 
mrttlon, the cuknryolic Cca 1 p counterparts have an N-lcnninal 
extension tliai is absent Or different in the eubactcrial and 
archacal proteinic Tins region moat likely directs the non-plant 
Ccalp in mitochondria. Plant Ccalp should also be directed to 
the chluroplasl. As chloroplast targeting information also is 
usually located m the N-iorminus and rescmWcs mitochondrial 
targeting information (40; for a review see 47). it is difficult to 
predict the function of the plant N-ferminal Ccalp extension*. 



Also* since no plant genome ha* been completely sequenced there 
could be different genes for miuiehondrial and chloroplast CCA 
activities. 

The location of other targeting information for Ccalp is 
unknown, but there are other regions that contain additions not 
found in cubneteria (94-r03; I09-I !4 5,<xwisto<> numbering). 
There are also extensive regions of the proteins thnt are dissimilar 
between cukaryotts and prokaryotcs (Fig. 5) that could contain 
nuclear targeting information. 

Unglp homology kuid conjM-vatlon of regfnna for subcellular 
distribotioD 

Uracil-DNA ^lycosylose (UNO or UDG) is u DNA repair 
enzyme. The nng gene is found in 33 cu bacteria, but is not 
present in archaca. Thus, cither another gene product supplies 
this function or this function h not required. Interestingly, or 
the 19 complete eubactcrial genomes, the wnjp gene Is absent 
from si* (Rickettsia prvwcrzekih Clostridium aeetohuiylieum* 
Treponema p<tHiAwn t A.aentfrns. Thcrtm>w$ti muritimct and 
Synechocysiis). again suggesting that this function may not be 
required. Also of note is that within the genus Ch>stridiim\vn& 
organism. CltwrUtium difficile, contains a ung geno while 
Cacctobutylietan does nol, UNC genes ure ulso prevent m 
some viruses an<| consensus sequences for the ling prujcln 
from 23 Herpes simplex viruses and five pox viruscx arc shown 
in Figure S5. 

The human hornolog of \hh enzyme h the m«v*i ihomujelily 
studied. BLAST starches revealed Ung homologjc In 1 1 other 
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eukaryoies. The mouse hnmolog is very similar to the human 
(90% similarity) and both sort this enzyme between the 
nucleus and mitochondria via a mechanism that depend? on 
alternative splicing (48,49; Fig. 6). This mechanism may also 
he used in Celegarrs as there is an extra 'exon' upstream of the 
UNO gene which could be used to supply additional targeting 
information, However, this putative exon does not resemble 
known MTS or NLS motifs. Disregarding this putative cxon 
ihc C.elegms ORF contains four in-frame ATGs* Downstream 
of AUG2 there is a sequence resembling a MTS, but we were 
unable to identify a classical simple or bipartite-like NLS in the 
N-tcntiinnl region, In S, ccrevisiae there are four methionines 
within the first 50 amino acids and alternative transcription or 
irunslaifon fiian sites could provide the sorting mechanism for 
this enr.ymc; however, the available data (50; P*Burgei*» 
personal communication) indicate that Unglp is solely nuclear 
and unlikely to son to mitochondria in yeast. 

Since Unglp should function within the nucleus of cukoryoics, 
there should be information to target this enzyme to the 
nucleus. Mas t of the eukaryotic and viral Ung proteins contain 
esiim N-tcrminal sequence information not found in the bacterial 
counterparts. The human and mouse nuclear targeting information 
rcsidcK within this region and Sxerevisine and PJakiparum 
appear to contain conventional bipartite MLSs within this 
region. 

CONCLUSIONS 

We surveyed five families of proteins containing at least one 
confirmed Honing feozyme. Four of these protein families have 
members ihat arc highly conserved across taxonomlc domains 
and the eukaryotic proteins contain additional sequences nor 



found in the eutwniicrial or archaeal counterpart*. Although the 
fifth protein, Ccalp, fits (he pattern/established by the other 
proteins hi a limited sense, large portions of this protein are 
dissimilar when compared across raxonomic domains. 

Additional information can be located at the N or C-tennini 
or it can be located internally. The location of additional 
sequence information is conserved, but the sequences are nut 
necessarily similar. It has been proposed thai intron locations 
, correspond to positions separating independent functional 
domains of proteins (51.52). Although our dutu set is limited, 
our analysis doc4 not appear 10 support this view. In general. 
ADEPTS do not correspond to genomic spliced regions. 

We summarize the evidence thai the additional sequences 
can encode infoijmation to sort the isozymes to appropriate 
subcellular locations (Fig. I). The doia lead u* to propose the 
ADEPT hypothesis that similarly located extra information in 
other eukaryotic fromologs will serve the same roles in protein 
subcellular distribution. We present this type «r analysis as a 
predictive tool Our results suggest that phylogcnctic 
comporison/mult^jle sequence alignment will be a useful tool 
for predicting the cell biological information content of protein 
sequences. Future mechanistic tents of the sequences identified 
here will be necessary to determine how accurate these 
ptcdJcdons are. However, data to date arc quite consistent with 
the ADEPT concept. 1 

SUPPLEMENTARY MATERIAL 

Sec Supplemental Material available at NAR Online. Update 
to the published Supplementary Maicrlul will be available at 
hltp;//www.col1med.psu,Cdu/Inbs/ahopper/DRR/AnEiFra^ 
sortpaper.htm 
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